DATA SCIENCE PROJECT - SPACESHIP TITANIC

This is the notebook developed for the course of Data Science for Economics and Finance by Dario Cagetti (0000918366), Giuseppe Difazio(0000918394), Manuel Rech (0000914421) e Michele Vagnetti (0000918419).

Introduction

Welcome to the year 2912. We’ve received a transmission from four lightyears away and things aren’t looking good. The Spaceship Titanic was an interstellar passenger liner launched a month ago. With almost 13,000 passengers on board, the vessel set out on its maiden voyage transporting emigrants from our solar system to three newly habitable exoplanets orbiting nearby stars. While rounding Alpha Centauri en route to its first destination—the torrid 55 Cancri E—the unwary Spaceship Titanic collided with a spacetime anomaly hidden within a dust cloud. Sadly, it met a similar fate as its namesake from 1000 years before. Though the ship stayed intact, almost half of the passengers were transported to an alternate dimension! In this competition, your task is to predict whether a passenger was transported to an alternate dimension during the Spaceship Titanic’s collision with the spacetime anomaly. To help you make these predictions, you’re given a set of personal records recovered from the ship’s damaged computer system

  • train.csv - Personal records for about two-thirds (~8700) of the passengers, to be used as training data.
    • PassengerId - A unique Id for each passenger. Each Id takes the form gggg_pp where gggg indicates a group the passenger is travelling with and pp is their number within the group. People in a group are often family members, but not always.
    • HomePlanet - The planet the passenger departed from, typically their planet of permanent residence.
    • CryoSleep - Indicates whether the passenger elected to be put into suspended animation for the duration of the voyage. Passengers in cryosleep are confined to their cabins.
    • Cabin - The cabin number where the passenger is staying. Takes the form deck/num/side, where side can be either P for Port or S for Starboard.
    • Destination - The planet the passenger will be debarking to.
    • Age - The age of the passenger.
    • VIP - Whether the passenger has paid for special VIP service during the voyage.
    • RoomService, FoodCourt, ShoppingMall, Spa, VRDeck - Amount the passenger has billed at each of the Spaceship Titanic’s many luxury amenities.
    • Name - The first and last names of the passenger.
    • Transported - Whether the passenger was transported to another dimension. This is the target, the column you are trying to predict.
  • test.csv - Personal records for the remaining one-third (~4300) of the passengers, to be used as test data. The task is to predict the value of Transported for the passengers in this set.

The dataset we have decided to use is derived from a Kaggle competition

Graphing

train_df$HomePlanet = as.factor(train_df$HomePlanet)
train_df$CryoSleep = as.factor(train_df$CryoSleep)
train_df$Destination = as.factor(train_df$Destination)
train_df$VIP = as.factor(train_df$VIP)
train_df$Name = as.factor(train_df$Name)
train_df$Transported = as.factor(train_df$Transported)

Let’s start to grasp an idea of what some dependencies in our dataset. We start out with a mosaic plot using the predictors HomePlanet and CryoSleep.

Categorical

This charts shows the number of transported and non-transported as a function of the various predictors. Take the chart with CryoSleep against Transported, it can give us an intuition in two ways:

  • looking at the boxes size -> we see that if CryoSleep was True much more passengers were transported rather than not transported. On the contrary if it was False much more people were not transported rather than transported.

  • looking at the chart colors -> transparent blocks refers to independence, while blue colors state that ‘there are more occurrences in this class with respect to the independent case’ and red ‘there are less occurrences in this class with respect to the independent case’. Therefore, the more intense is the color of the boxes, the more dependence there is between that variable and the outcome Transported

The same intuition can be developed also with the other categorical variables

## [1] TRAPPIST-1e   PSO J318.5-22 55 Cancri e   <NA>         
## Levels: 55 Cancri e PSO J318.5-22 TRAPPIST-1e

Numerical

To grab some intuition on how the raw variables look like, here is the summary of the summary of the numeric predictors.

##       Age         RoomService        FoodCourt        ShoppingMall    
##  Min.   : 0.00   Min.   :    0.0   Min.   :    0.0   Min.   :    0.0  
##  1st Qu.:19.00   1st Qu.:    0.0   1st Qu.:    0.0   1st Qu.:    0.0  
##  Median :27.00   Median :    0.0   Median :    0.0   Median :    0.0  
##  Mean   :28.83   Mean   :  224.7   Mean   :  458.1   Mean   :  173.7  
##  3rd Qu.:38.00   3rd Qu.:   47.0   3rd Qu.:   76.0   3rd Qu.:   27.0  
##  Max.   :79.00   Max.   :14327.0   Max.   :29813.0   Max.   :23492.0  
##  NA's   :179     NA's   :181       NA's   :183       NA's   :208      
##       Spa              VRDeck       
##  Min.   :    0.0   Min.   :    0.0  
##  1st Qu.:    0.0   1st Qu.:    0.0  
##  Median :    0.0   Median :    0.0  
##  Mean   :  311.1   Mean   :  304.9  
##  3rd Qu.:   59.0   3rd Qu.:   46.0  
##  Max.   :22408.0   Max.   :24133.0  
##  NA's   :183       NA's   :188

By looking at this graph we can notice that people older than 30 y.o. have been transported with pretty much the same distribution. Before that level we can highlight how people in their twenties has been sacrificed, probably to help younger and older ones. Indeed, we can also see how different the distribution for minor than 10/15 y.o. is, here we have a lot more people of this age transported.

In both violin plots we can see almost the same behaviour. Not much differences between non-transported and transported with the orange observations being a bit thicker above about 50. This means that not a lot of passengers spent a lot in these plus services and also that spending more did not ensure you a preferential treatment when the accident has occurred.

Here we can notice a few things. First that among all the people that spent less than 30 in the VRDeck services the majority was not transported. In contrast, also here we can see a thicker orange values above 50 meaning that not a lot of passengers spent a lot on this service and that, again, spending more did not means a better treatment in case of casualty.

As in the previous case also here we can observe a similar distribution in both the plots. Anyway, unlike before we have a different pattern. Passengers that spent very little has hardly been transported, indeed people that spent, not a lot, but a discrete sum on those services, have more probability of being transported. For the third time we can spot a thicker orange values after the most of the observations meaning, again, that not a lot of passengers spent a lot on these services and that spending more did not means a better treatment in case of casualty.

Data pre-processing

To start out let’s see how the initial dataset looks like.

head(train_df)

Features engineering

Seeing from the information reported about the variables, we can notice that two columns, PassengerId and Cabin hide some information that may be exploited by creating multiple columns.

Firstly we split PassengerId, built up as ‘gggg_pp’

head(train_df$PassengerId)
## [1] "0001_01" "0002_01" "0003_01" "0003_02" "0004_01" "0005_01"
  • group is the ‘gggg’ part, and references to the group they belong to, think of a family as a group

  • number_within_group is the ‘pp’ part and makes reference to the number within the group these people take

Secondly, we split the Cabin column, built as ‘deck/num/side’,

head(train_df$Cabin)
## [1] "B/0/P" "F/0/S" "A/0/S" "A/0/S" "F/1/S" "F/0/P"
  • side can be either P for Port or S for Starboard

  • deck references to the position on the ship

  • num is another futher categorization of the place took for travelling

Let’s graph the mosaic plot also for these three newly created variables

Elimination of irrelevant variables

The datasets, after pre-processing are loaded into the R environment and as the categorical variables as themselves are just a list of characters, it is appropriate to convert them into factors. However, some factor present levels that are too many, sometimes almost as many as the rows.

As intuition suggests the variable Name may be irrelevant for prediction of a passenger to be transported, indeed it has 8473 levels, on a dataset with 8693 observations. For this reason we drop it.

A similar reasoning is done with the variables Num with 1817 levels, derived from the old variable Cabin (that was decomposed as Deck/Num/Side). Notice that Deck has 2 levels and Side 2 instead.

Also the variable Group with 6217 levels is dropped. This was derived from the Passenger_id variable, and divided into Group and number_within_group, the latter with 8 levels

Handling missing values

After, this preliminary manual selection of the predictors, we want to count and take care of the missing values appropriately. This is how many NA values we have per variable

## [1] "train dataframe missing values"
##          HomePlanet           CryoSleep         Destination                 Age 
##                 201                 217                 182                 179 
##                 VIP         RoomService           FoodCourt        ShoppingMall 
##                 203                 181                 183                 208 
##                 Spa              VRDeck         Transported                deck 
##                 183                 188                   0                 199 
##                side number_within_group 
##                 199                   0
## [1] "test dataframe missing values"
##          HomePlanet           CryoSleep         Destination                 Age 
##                  87                  93                  92                  91 
##                 VIP         RoomService           FoodCourt        ShoppingMall 
##                  93                  82                 106                  98 
##                 Spa              VRDeck                deck                side 
##                 101                  80                 100                 100 
## number_within_group 
##                   0

Now, using the library ImputeMissing we substitute the missing values using the following criteria:

  • median for numerical variables

  • mode for categorical variables

## [1] "train dataframe missing values after imputation"
##          HomePlanet           CryoSleep         Destination                 Age 
##                   0                   0                   0                   0 
##                 VIP         RoomService           FoodCourt        ShoppingMall 
##                   0                   0                   0                   0 
##                 Spa              VRDeck         Transported                deck 
##                   0                   0                   0                   0 
##                side number_within_group 
##                   0                   0
## [1] "test dataframe values after imputation"
##          HomePlanet           CryoSleep         Destination                 Age 
##                   0                   0                   0                   0 
##                 VIP         RoomService           FoodCourt        ShoppingMall 
##                   0                   0                   0                   0 
##                 Spa              VRDeck                deck                side 
##                   0                   0                   0                   0 
## number_within_group 
##                   0

Scaling the data

##       Age         RoomService      FoodCourt        ShoppingMall    
##  Min.   : 0.00   Min.   :    0   Min.   :    0.0   Min.   :    0.0  
##  1st Qu.:20.00   1st Qu.:    0   1st Qu.:    0.0   1st Qu.:    0.0  
##  Median :27.00   Median :    0   Median :    0.0   Median :    0.0  
##  Mean   :28.79   Mean   :  220   Mean   :  448.4   Mean   :  169.6  
##  3rd Qu.:37.00   3rd Qu.:   41   3rd Qu.:   61.0   3rd Qu.:   22.0  
##  Max.   :79.00   Max.   :14327   Max.   :29813.0   Max.   :23492.0  
##       Spa              VRDeck       
##  Min.   :    0.0   Min.   :    0.0  
##  1st Qu.:    0.0   1st Qu.:    0.0  
##  Median :    0.0   Median :    0.0  
##  Mean   :  304.6   Mean   :  298.3  
##  3rd Qu.:   53.0   3rd Qu.:   40.0  
##  Max.   :22408.0   Max.   :24133.0

Here we can see the numeric predictors for this dataset: there are 5 predictors regarding expenses done by passengers in the spaceship, and 1 predictor indicating the age of the passengers. As the range of values that these predictors take is not the same and nor is the mean, we proceed with scaling these predictors in order to reach mean = 0 and standard deviation equal to 1

train_df[ , numeric_cols_train] = scale(train_df[, numeric_cols_train])
test_df[ , numeric_cols_test] = scale(test_df[, numeric_cols_test])
summary(train_df[, numeric_cols_train])
##       Age           RoomService        FoodCourt        ShoppingMall    
##  Min.   :-2.0075   Min.   :-0.3331   Min.   :-0.2810   Min.   :-0.2836  
##  1st Qu.:-0.6129   1st Qu.:-0.3331   1st Qu.:-0.2810   1st Qu.:-0.2836  
##  Median :-0.1248   Median :-0.3331   Median :-0.2810   Median :-0.2836  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5724   3rd Qu.:-0.2710   3rd Qu.:-0.2428   3rd Qu.:-0.2468  
##  Max.   : 3.5010   Max.   :21.3574   Max.   :18.4013   Max.   :39.0003  
##       Spa              VRDeck       
##  Min.   :-0.2706   Min.   :-0.2630  
##  1st Qu.:-0.2706   1st Qu.:-0.2630  
##  Median :-0.2706   Median :-0.2630  
##  Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.:-0.2235   3rd Qu.:-0.2277  
##  Max.   :19.6377   Max.   :21.0159

Train set/validation set creation

Now, the dataset has been processed and now we are ready to try some machine learning models to understand which is the best one to make predictions onto unknown data

## The mean of transported in train_set is 0.5036426  while the mean of transported in validation_set is 0.5035951 , so the division seems to be done evenly
head(train_df)

Subset selection

Now, we know we have a classification problem, not a regression, so using methods as BSS, FSS, BSS is not the best choice, however, this can give a sense of what variables may be more relevant. As we’ll see some of the findings found here are in line with predictions done by other classifiers.

We are using the library Leaps for this simple regression and the choice of selection models falls into the FSS. This is because Forward, Backwards and Best Subset Selection yield a very similar result, and therefore we choose the model that is less computational expensive, for optimization

Linear Regression

As we know from theory, when we fit a linear regression model, we minimize the Residual Sum of Squares of the training set and consequently the Mean Squared Error (that is, MSE = RSS/n). However, as we care about model accuracy, therefore the ability of this regression of fitting unseen data, we do not pay too much attention on the training MSE tends, we prefer instead the test MSE. The latter usually is underestimated by the prior (training MSE is monotonically decreasing while test MSE is u-shaped on the axis MSE, flexibility) and therefore we use some different approaches to estimate the test MSE: Mallow’s Cp, Bayesian Information Criteria and the adjusted R2 coefficient.

We are not going into details of the meaning of these methods, but it is sufficient to state that, after we performed the same steps with the different methods, we ended up having very very similar results, therefore we only show here results for Cp criterion.

## model estimation
forward_subset_selection = regsubsets(Transported ~ ., data = train_set, nvmax = 27, method = "forward")

Logistic regression

However, as we have a classification task it’s more proper to use a logistic regression instead of a classification one. Furthermore, we also perform some subset selection using the function step, that by default is doing a backward step-wise selection

set.seed(1)
logistic = glm(Transported ~ ., data=train_set, family = binomial())

bestlogit <- step(logistic, trace = FALSE)

bestlogit$anova

The call anova on the bestlogic moduel states that the predictors ‘VIP’ and ‘number within group’ are not relevant for the classification and therefore are omitted from the optimal model estimated

summary(bestlogit)
## 
## Call:
## glm(formula = Transported ~ HomePlanet + CryoSleep + Destination + 
##     Age + RoomService + FoodCourt + ShoppingMall + Spa + VRDeck + 
##     deck + side, family = binomial(), data = train_set)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.9864  -0.6692   0.0372   0.6982   3.2356  
## 
## Coefficients:
##                          Estimate Std. Error z value Pr(>|z|)    
## (Intercept)              -2.18367    0.38244  -5.710 1.13e-08 ***
## HomePlanetEuropa          1.61778    0.27410   5.902 3.59e-09 ***
## HomePlanetMars            0.49322    0.11900   4.145 3.40e-05 ***
## CryoSleepTrue             1.42710    0.10110  14.116  < 2e-16 ***
## DestinationPSO J318.5-22 -0.30240    0.14440  -2.094 0.036245 *  
## DestinationTRAPPIST-1e   -0.33654    0.10337  -3.256 0.001131 ** 
## Age                      -0.13225    0.03834  -3.449 0.000562 ***
## RoomService              -1.00264    0.07976 -12.571  < 2e-16 ***
## FoodCourt                 0.82014    0.08312   9.866  < 2e-16 ***
## ShoppingMall              0.33845    0.04983   6.792 1.10e-11 ***
## Spa                      -2.35344    0.15630 -15.058  < 2e-16 ***
## VRDeck                   -2.17879    0.15515 -14.043  < 2e-16 ***
## deckB                     1.10984    0.33127   3.350 0.000807 ***
## deckC                     2.45846    0.37218   6.606 3.96e-11 ***
## deckD                     0.82780    0.36680   2.257 0.024020 *  
## deckE                     0.09739    0.37001   0.263 0.792391    
## deckF                     0.88112    0.37057   2.378 0.017419 *  
## deckG                     0.46343    0.38094   1.217 0.223776    
## deckT                    -0.16451    1.92734  -0.085 0.931977    
## sideS                     0.58435    0.07464   7.829 4.90e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 7230.6  on 5215  degrees of freedom
## Residual deviance: 4471.1  on 5196  degrees of freedom
## AIC: 4511.1
## 
## Number of Fisher Scoring iterations: 7

also, as we can see from this summary, the deck predictor is encoded in its levels minus 1 to avoid collinearity. However the groups G, T and E estimates are not considered very significant. To exploit all the predictive power of the model, let’s make an attempt to join these three levels in the factor ‘deck’ and re-estimate the model.

The level EGT comprises the three levels E, G and T now.

set.seed(1)
logistic2 = glm(Transported ~ ., data=train_set_logistic, family = binomial())

bestlogit2 <- step(logistic2,  trace = FALSE)

summary(bestlogit2)
## 
## Call:
## glm(formula = Transported ~ HomePlanet + CryoSleep + Destination + 
##     Age + RoomService + FoodCourt + ShoppingMall + Spa + VRDeck + 
##     deck + side, family = binomial(), data = train_set_logistic)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.9181  -0.6679   0.0371   0.6920   3.1951  
## 
## Coefficients:
##                          Estimate Std. Error z value Pr(>|z|)    
## (Intercept)              -2.28904    0.37543  -6.097 1.08e-09 ***
## HomePlanetEuropa          1.76549    0.26640   6.627 3.42e-11 ***
## HomePlanetMars            0.71184    0.10511   6.772 1.27e-11 ***
## CryoSleepTrue             1.28408    0.09386  13.680  < 2e-16 ***
## DestinationPSO J318.5-22 -0.32554    0.14482  -2.248 0.024589 *  
## DestinationTRAPPIST-1e   -0.35353    0.10318  -3.427 0.000611 ***
## Age                      -0.11220    0.03783  -2.966 0.003019 ** 
## RoomService              -1.00109    0.07966 -12.567  < 2e-16 ***
## FoodCourt                 0.81332    0.08294   9.806  < 2e-16 ***
## ShoppingMall              0.34373    0.05010   6.861 6.82e-12 ***
## Spa                      -2.33047    0.15505 -15.031  < 2e-16 ***
## VRDeck                   -2.15059    0.15382 -13.981  < 2e-16 ***
## deckB                     1.13252    0.32652   3.468 0.000523 ***
## deckC                     2.42304    0.36633   6.614 3.73e-11 ***
## deckD                     0.77079    0.36028   2.139 0.032403 *  
## deckF                     0.14513    0.36383   0.399 0.689983    
## deckEGT                   0.81499    0.36401   2.239 0.025161 *  
## sideS                     0.58644    0.07452   7.870 3.55e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 7230.6  on 5215  degrees of freedom
## Residual deviance: 4487.6  on 5198  degrees of freedom
## AIC: 4523.6
## 
## Number of Fisher Scoring iterations: 7
bestlogit2$anova

Also, by calling again the anova function on the bestlogit2 module, we see that the same two predictors are eliminated, therefore we can assume that, using these criteria we have reached an optimum for the logistic regression

## Logistic AUC:  0.8727135

Shrinkage methods

Given that subset selection generally led to over-fitting data, let assess the performance of new models based on regularization methods, which reduce variance by adding a penalty function.

In order to fit our model using Ridge, Lasso and Elastic Net models we need to create four matrixes, two for x-values and y-values respectively, that contains observation of the train and the test test, which has been splitted randomly using stratified sampling technique.

set.seed(1)
X = model.matrix(Transported ~ . -1, data = train_df) 
y = train_df$Transported

set.seed(1)
train.label = sample.split(train_df$Transported, SplitRatio = 3/5)
X.train = X[train.label, ] 
y.train = y[train.label]  
X.test = X[-train.label, ]
y.test = y[-train.label]

y.train = as.numeric(y.train) -1 
y.test = as.numeric(y.test) - 1

Regularization methods minimize the RSS plus penalty parameter that it is called shrinkage penalty, which has the effect of shrinking the estimates of the model’s parameters towards zero. The amount of penalty is tuned by the parameter lamba that can assume any values in the range [0,+∞), where 0 represent the case when the outcome is exactly equal to a OLS regression and +∞ the case when there is only the intercept in the model, that is the mean value of the response.

Ridge

We are using the elastic net formulation of the penalty function which weigh the lasso and the ridge penalties by coefficients α and (1-α), respectively. Thus, to have a ridge regression we put α equal to 0.

Unlike OLS, ridge regression creates different estimates of the parameters depending on the value of lambda, thus, we need to cross-validate its value to chose the best model.

set.seed(1)
cv.ridge = cv.glmnet(X.train, y.train, alpha = 0, family=binomial()) 
# alpha = 0 is for Ridge, alpha = 1 is for Lasso

plot(cv.ridge)

coef(cv.ridge)
## 29 x 1 sparse Matrix of class "dgCMatrix"
##                                    s1
## (Intercept)              -0.591685071
## HomePlanetEarth          -0.406999503
## HomePlanetEuropa          0.502547785
## HomePlanetMars            0.041909340
## CryoSleepTrue             1.506322382
## DestinationPSO J318.5-22 -0.171274572
## DestinationTRAPPIST-1e   -0.221643907
## Age                      -0.123017919
## VIPTrue                  -0.224804558
## RoomService              -0.557053667
## FoodCourt                 0.294931913
## ShoppingMall              0.245987054
## Spa                      -0.747441128
## VRDeck                   -0.737830686
## deckB                     0.489601986
## deckC                     0.785854822
## deckD                     0.011471137
## deckE                    -0.476742899
## deckF                     0.081439620
## deckG                    -0.130125920
## deckT                     0.147991266
## sideS                     0.415878176
## number_within_group2      0.162559729
## number_within_group3      0.343219701
## number_within_group4      0.051401794
## number_within_group5     -0.243078487
## number_within_group6      0.004613333
## number_within_group7      0.230267703
## number_within_group8     -0.032922557

Given the bias-variance trade-off we select the best model under the one-standard error rule, basically selecting not the absolute best model but that that is within one-standard deviation from it.

## Ridge:  0.8690364

Lasso

The Lasso regression is computed selecting α = 1, which is the standard default option in glmnet. Unlike Ridge, Lasso performs variables selection, which means that the coefficients estimate of some insignificant variables is exactly zero. Thus, providing a sparse model Lasso regression improves the interpretability of the model over the Ridge one, even though this not means that it is necessarily better in predicting data.

set.seed(1)
cv.lasso = cv.glmnet(X.train, y.train, family=binomial())
plot(cv.lasso)

coef(cv.lasso)
## 29 x 1 sparse Matrix of class "dgCMatrix"
##                                   s1
## (Intercept)              -0.77011976
## HomePlanetEarth          -0.44771369
## HomePlanetEuropa          0.71969559
## HomePlanetMars            .         
## CryoSleepTrue             1.45807539
## DestinationPSO J318.5-22 -0.03475148
## DestinationTRAPPIST-1e   -0.12483670
## Age                      -0.08323170
## VIPTrue                   .         
## RoomService              -0.76957200
## FoodCourt                 0.48946519
## ShoppingMall              0.26887562
## Spa                      -1.47885917
## VRDeck                   -1.35225864
## deckB                     0.29279437
## deckC                     1.01757061
## deckD                     .         
## deckE                    -0.53289611
## deckF                     .         
## deckG                    -0.23840332
## deckT                     .         
## sideS                     0.45061428
## number_within_group2      0.10493157
## number_within_group3      0.25154842
## number_within_group4      .         
## number_within_group5     -0.02388122
## number_within_group6      .         
## number_within_group7      .         
## number_within_group8      .

For the bias-variance trade-off we select the best model under the one-standard error rule. As it is possible to see in the graph, this not only decreases the variance of the model but also allow us to simplify the model using “just” 18 variables rather than 23, improving the overall interpretability.

## Lasso:  0.8767769

Elastic net

The elastic net regression is performed assigning α a value in the range (0,1) and in this case we chose the mean value α = 0.5, even though a much coherent analysis should perform a cross-validation of the parameter α, to find the best model.

set.seed(1)
cv.elnet = cv.glmnet(X.train, y.train, alpha = 0.5, family = binomial())
plot(cv.elnet)

coef(cv.elnet)
## 29 x 1 sparse Matrix of class "dgCMatrix"
##                                   s1
## (Intercept)              -0.75523336
## HomePlanetEarth          -0.45343649
## HomePlanetEuropa          0.65912155
## HomePlanetMars            .         
## CryoSleepTrue             1.51841231
## DestinationPSO J318.5-22 -0.14023416
## DestinationTRAPPIST-1e   -0.20032414
## Age                      -0.10541044
## VIPTrue                  -0.02410208
## RoomService              -0.77057660
## FoodCourt                 0.49048361
## ShoppingMall              0.29495620
## Spa                      -1.38635880
## VRDeck                   -1.29555362
## deckB                     0.44087386
## deckC                     1.12262722
## deckD                     .         
## deckE                    -0.53210470
## deckF                     0.06090069
## deckG                    -0.22368901
## deckT                     .         
## sideS                     0.47994901
## number_within_group2      0.14955858
## number_within_group3      0.31832917
## number_within_group4      .         
## number_within_group5     -0.17064692
## number_within_group6      .         
## number_within_group7      .         
## number_within_group8      .

Following the same approach, we used the one-standard error rule. As it is possible to see from the graph, elastic net has some variables selection properties and it allows to decrease the variables in the model from 25 to 21, even though, as expected, it perform slightly worse than Lasso, which can eliminate three more variables.

## Elastic net alpha 0.5:  0.8763227

Tree methods

Single tree

In order to enter into tree-based method, we plot the simplest case: the single tree.

set.seed(1)
tree.train = tree(Transported ~ ., train_set)
plot(tree.train)
text(tree.train, pretty = 0)

tree.pred = predict(tree.train, validation_set, type = "vector")

# cv.tree.train = cv.tree(tree.train, FUN = prune.misclass)
# 
# plot(cv.tree.train$size, cv.tree.train$dev,
#      type = 'b', pch = 19, col = 'red',
#      xlab = 'Tree size', ylab = 'missclassification')
# 
# prune.train = prune.tree(tree.train, best = 8)
# plot(prune.train)
# text(prune.train, pretty = 0)

## [1] 0.8312572

However, estimates for this methods will never perform as well as the Random Forest, which is an ensemble method using many of these trees.

Random Forest

Using the library randomForest we estimate a random forest predictor using the default value of 500 trees and as the documentation says, the rounded down to integer value of the square root of number of columns, that is: 3.7 rounded to 3. As we’ll see this is already quite an optimal result

set.seed(1)
#sqrt(ncol(train_set))
rf = randomForest(Transported ~ ., data = train_set)
rf
## 
## Call:
##  randomForest(formula = Transported ~ ., data = train_set) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 19.69%
## Confusion matrix:
##       False True class.error
## False  2003  586   0.2263422
## True    441 2186   0.1678721

Now we want to fine tune the number of variables used in each of the tree in the Random Forest predictor, and for this scope we may use a function called tuneRF in the library randomForest but we think that is not very robust as it may lead to some misleading local optimum result.

We proceed instead by estimating as many random forest models as the predictors, changing the mtry at every iteration. By graphing this, we pick the mtry (number of predictors) that is bounded with the lowest out-of-bag error and/or test error.

As we can see mtry = 2 is the optimal value, let’s estimate now the best random forest

set.seed(1)
rf2 = randomForest(Transported ~ ., data = train_set, mtry = 2, ntree = 500)
rf2
## 
## Call:
##  randomForest(formula = Transported ~ ., data = train_set, mtry = 2,      ntree = 500) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 19.82%
## Confusion matrix:
##       False True class.error
## False  2008  581    0.224411
## True    453 2174    0.172440

And also the chart of the most important variables using the Mean Decrease in the Gini index criterion.

This instead is the ROC curve for this Random Forest

with an accuracy of

## RandomForest (m = 3):  0.8641943

## RandomForest (m = 2):  0.8579669

Bagging

Still in the tree based methods, we can use a bagging algorithm. This follows the same logic as the random forest classifier, with the only difference of the number of variables taken to estimate each tree in the forest. In this case it is the maximum number of predictors

set.seed(1)
bagging = randomForest(Transported ~ ., data = train_set, mtry = 13, ntree = 500)

Let’s plot now the ROC curves for this models

Let’s take a look at the Area Under Curve for these two models

##  Bagging      (m = 8):  0.8568759

Boosting

Now let’s implement a boosting algorithm. This is another ensemble method of many trees that, differently from Random Forests and Bagging, it learns slowly. Trees indeed have very few terminal nodes and each subsequent tree is built on the previous, in order to always improve the estimating power of the tree in areas where it is not performing well.

Using the parameter cv in the gbm function, we can perform a cross validation to select the optimal number of trees to be estimated in order to have better predictive ability

set.seed(1)
cv.boost.train = gbm(Transported ~ ., 
                     data = train_set, 
                     distribution = 'bernoulli', 
                     n.trees = 10000, 
                     shrinkage = 0.01, 
                     interaction.depth = 2,
                     cv.folds = 10,
                     n.cores = 2)

# set.seed(1)
# cv.boost.train1 = gbm(Transported ~ ., 
#                      data = train_set, 
#                      distribution = 'bernoulli', 
#                      n.trees = 10000, 
#                      shrinkage = 0.01, 
#                      interaction.depth = 1,
#                      cv.folds = 10,
#                      n.cores = 2)

best.iter = gbm.perf(cv.boost.train, method = "cv")
# best.iter1 = gbm.perf(cv.boost.train1, method = "cv")
legend("topright", legend = c("Train","Validation"), 
       pch = 19, col = c("black","Yellow"))

cat('The optimal numer of trees is: ', best.iter)
## The optimal numer of trees is:  5690

We now fit the boosting algorithm using this number of trees

set.seed(1)
best.boost.train = gbm(Transported ~ ., 
                       data = train_set, 
                       distribution = 'bernoulli', 
                       n.trees = best.iter, 
                       interaction.depth = 2,
                       shrinkage = 0.01)
# best.boost.train1 = gbm(Transported ~ ., 
#                        data = train_set, 
#                        distribution = 'bernoulli', 
#                        n.trees = best.iter, 
#                        interaction.depth = 1,
#                        shrinkage = 0.01)

And this is the performance of the tree

## [1] 0.882844

SVM

Support Vector Machine (SVMs) classifier for binary problems is a machine learning technique that separates the hyperplane built by the predictors in two ‘parts’ in order to have all the observations belonging to one class into the same part while all the others in the other part

As all machine learning models we have a parameter to set in order to constrain flexibility and obtain the minimum test error possible. In this case, we have the cost parameter which controls the penalty of misclassification. In other words how many wrong observation we allow in order to avoid overfitting the data but also fitting it, so not underfitting the data.

set.seed(1)
# run 10-CV
# linear_svm = tune(svm, Transported ~ ., data = train_set, 
#                kernel = "linear", 
#                scale = TRUE, 
#                ranges = list(cost = c(0.1, 1, 5,6,7,8,9,10,11,12,13,14,15,20)))
load(file='tune_out_linear.Rdata')
tune.outL$best.model
## 
## Call:
## best.tune(method = svm, train.x = Transported ~ ., data = train_set, 
##     ranges = list(cost = c(0.1, 1, 5, 6, 7, 8, 9, 10, 11, 12, 13, 
##         14, 15, 20)), kernel = "linear", scale = TRUE)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  linear 
##        cost:  6 
## 
## Number of Support Vectors:  2508
plot(tune.outL, main='error as a function of cost, linear svm')

As we can see from the chart and from the table above, the best SVM classifier is reached with the cost parameter = 6. Using this we can make predictions on the validation_set and plot the usual ROC curve and AUC.

## [1] 0.8615537

To account for non-linear boundaries we may try to use non-linear specifications of the SVM classifier.

Non linear Support Vector

Specifically we have decided to use the radial kernel.In this case, SVM is not anymore a global method as the liner SVC was. Indeed we are classifying unknown observations with known observation that sit on the nearby in terms of predictor specifications. This classifier somehow recall of the KNN classifier, with the difference that we are creating boundaries within the dataset and not classifying observation per observation.

set.seed(1)
# cross-validating the tuning parameters gamma and lambda 
# svm_non_linear = tune(svm, Transported ~ ., data = train_set,
#                 kernel = "radial",
#                 ranges = list(cost = c(0.1,1,5,10,15,10,25,30),
#                               gamma = c(0.5, 1, 2, 3, 4)))
load('svm_non_linear.Rdata')
cat('Below there are the best parameters')
## Below there are the best parameters
svm_non_linear$best.parameters
plot(svm_non_linear, main = 'Optimal parameters for non-linear radial SVM')

However we may notice that the best pick for cost and gamma are the lowest value they can possibly take, in this case cost = 0.1 and gamma = 0.5. However let’s plot the ROC and calculate the AUC

## False  True 
##  1755  1722

## [1] 0.8518311

Neural Network

In order to fit the Neural Network we need to use the function neuralnet, thus, since it works only with quantitative variables we need to convert factor variables into a numerical dummies performing a one-hot encoding of the datasets.

Firstly, we create a dataframe containing the categorical variables, then we apply the function dummyVars that convert factors into numerical dummy variables that are stored in a new dataset. In order to avoid collinearity issues we delete one dummy for each categorical variables. Finally, we add back the numerical variables that are stored in the original dataset.

The O-H encoding is performed in the following order: 1. train_set, 2. validation_set 3. test_df.

Using neuralnet we can fit the model using as covariates all the variables in the dataset, as shown by the formula Transported.True ~. within the model. Since we are facing a classification problem we want to minimize ce, the cross-entrophy, which is the negative of the log-likelihood and using the logistic function as activation function because the output should take the form of probabilites, thus they must lie in a range between [0,1].

Given the huge number of the tuning parameters in this kind of models and the need to spend lot of computational power performing cross-validation to find the optimal combination, we decided to estimate just a one hidden-layer network.

Selecting a single hidden-layer with 14 nodes we need to estimates a total of (27 + 1) x 14 = 392 parameters! This means that there is a tangible risk of over-fitting the data, which derives also form the huge variability of different combination of tuning parameters.

library(neuralnet)

set.seed(2020)
nn1 = neuralnet(Transported ~.,                      
                data = train_set_new,      
                stepmax = 1e+7,                  
                hidden = c(14),                   
                algorithm = "rprop+",             
                err.fct = "ce",                  
                act.fct = "logistic",            
                linear.output = FALSE,             
                lifesign = 'full',               
                lifesign.step = 1000,             
                threshold = 1,                    
                rep = 3)

Here we can see the ROC curves and the performances for the Neural Networks

## Neural Network rep = 1:  0.8681485
## Neural Network rep = 2:  0.8591009
## Neural Network rep = 3:  0.8576715

Let provide a graphical representation of the 1st network that is the best-performing model, with an AUC of 0.868. that show the resulting coefficients for each node connection. As it is possible to see form the graph, it is extremely hard understand the connection between the variables, and this together with the relative low prediction results led us to select other simple and less-computational expense models.

Conclusions

##                   Rank       AUC
## 1             Boosting 0.8824744
## 2          Elastic net 0.8765516
## 3                Lasso 0.8764930
## 4  Logistic regression 0.8727135
## 5                Ridge 0.8684710
## 6         Neural net 1 0.8681485
## 7  Random forest m = 3 0.8641943
## 8           Linear SVM 0.8615537
## 9         Neural net 2 0.8591009
## 10 Random forest m = 2 0.8579669
## 11        Neural net 3 0.8576715
## 12             Bagging 0.8568759
## 13      Non-linear SVM 0.8518311

Among all the methods we have used, the boosting turned out to be best performing in terms of AUC, with a value of 0.8824398, so we decided to use that model to get the prediction score of that model.

For the boosting, kaggle.com results were 0.79448.